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ABSTRACT 


How to identify at-risk students in open online courses has 
received increasing attention, since the dropout rate is unex- 
pectedly high. Most prior studies have focused on using ma- 
chine learning techniques to predict student dropout based 
on features extracted from students’ learning activity logs. 
However, little work has viewed the dropout prediction prob- 
lem as a sequence classification problem in the consideration 
that the dropout probability of a student at the current time 
step can be likely dependent on her/his engagement at the 
previous time step. Therefore, in this paper, we propose 
a nonlinear state space model to solve this problem. We 
show how students’ latent states at different time steps can 
be learned via this model, and demonstrate its outperform- 
ing prediction accuracy relative to related methods through 
experiment. 
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1. INTRODUCTION 


With the advent of open online courses, such as MOOC web- 
sites Edx, Coursera, Khan Academy, high quality education 
can easily be accessed by students at low cost. However, al- 
though many thousands of participants have enrolled on the 
online courses, their dropout rate is extremely higher than 
expected. As reported in [8], the average dropout rate of 
current MOOCs is approximately 75%. 


Identifying at-risk students by predicting their dropout prob- 
ability thus becomes timely important, given that early pre- 
diction can help instructors provide proper support to those 
students to retain their learning interests. To address this 
issue, some researchers focused on extract features from stu- 
dents’ learning activities (such as watching videos, working 
on assignments, and posting in or viewing discussion forums) 
for building machine learning models (like support vector 
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machine (SVM) [9] and logistic regression (LG) [14]). How- 
ever, they rarely considered that students’ learning activities 
across different time steps (e.g., weeks) might be interrelated 
and take different weights in making the prediction. For in- 
stance, recent activities could be more important to reflect 
students’ engagement degree. If a student actively engages 
with a course in the current week, it is more likely that 
s/he will continue to engage with this course in the coming 
week. Otherwise, if s/he becomes inactive, it may infer that 
her/his interest in the course is decreased. Recently, though 
some approaches, such as the one based on Hidden Markov 
Model (HMM) [2] and that based on Recurrent Neural Net- 
work (RNN) [12], have been proposed to model students’ 
states over time, they still suffer from some issues: 1) the 
estimation of next state depends only on the current state; 
2) the estimated states are deterministic that would lead to 
error propagation in the estimation procedure; 3) the pa- 
rameters of their models are time-invariant. 


In our work, we focus on predicting whether a student will 
have activities in the coming week. We particularly for- 
mulate this issue as sequential classification problem, and 
develop Nonlinear State Space Model (NSSM) [1] to solve 
it. Essentially, NSSM has several advantages. Firstly, it can 
be used to discover a student’s latent state (i.g., engagement 
pattern) to characterize the student’s intention to perform 
certain activities. The student’s dropout probability is then 
computed based on the state estimated for that time. Sec- 
ondly, relative to HMM and RNN, NSSM takes into account 
all of the current and previous states to estimate next state. 
It can also accommodate uncertainty given that the state in 
NSSM is a set of random variables with multivariate Gaus- 
sian distribution. Thirdly, the parameters in NSSM are time 
varying (i.e., being different at different time steps), which 
makes it more flexible to model students’ dynamics. 


In short, this paper has two main contributions: 1) we im- 
plement Nonlinear State Space Model (NSSM) to address 
the dropout prediction problem, which particularly models 
students’ latent states varying over time; 2) we conduct ex- 
periment to compare our method with related ones including 
logistic regression (LG), simultaneously smoothed logistic 
regression (LR-SIM), and RNN with long short-term mem- 
ory cell (LSTM). It shows that our method is more accurate 
in identifying at-risk students who tend to drop out. 


In the remainder, we first describe related work in Section 2, 
and then present our methodology in Section 3. In Section 4, 
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we give experimental results. In Section 5, we conclude our 
work and indicate its future directions. 


2. RELATED WORK 


High dropout rate that popularly exists in current MOOCs 
has driven some researchers to investigate the issue of iden- 
tifying at-risk students who are likely to quit. They have 
considered different features to build the prediction model, 
such as those extracted from clickstream data (e.g., watch- 
ing a lecture video, posting to discuss forums, submitting an 
assignment) [2, 5, 6, 9, 14], quiz performance [5, 6, 14], cen- 
trality of students in discussion forums [15], and sentiments 
of discussion forum posts [4]. 


As for prediction model, some studies have applied support 
vector machines (SVM) [9], logistic regression (LG) [14], 
survival analysis techniques like Cox proportional hazard 
model [15], and probabilistic soft logic (PSL) [13]. However, 
their common limitation is that they assume a student’s 
dropout probabilities at different time steps are indepen- 
dent, which limits the approach’s applicability in practice 
as usually a student’s state at one time can be influenced by 
her/his previous state. 


Alternatively, [6] extended logistic regression model to smooth 
the dropout probabilities across weeks with the aim to min- 
imize the difference of successing predicted probabilities be- 
tween weeks. [2] used Hidden Markov Model (HMM) to 
model student’s actions over time, which encodes their be- 
haviour features into a set of mutually exclusive discrete 
states. [12] adopted Recurrent Neural Network (RNN) model 
with long short-term memory (LSTM) cells, which is able 
to encode features into continuous states. However, though 
RNN may be advantageous against HMM, it inherently suf- 
fers from error propagation phenomenon because the estima- 
tion of current state depends only on the estimated previous 
state. 


In comparison, in our model, the uncertainty of estimated 
states is considered by representing the state as random 
variables drawing from a multivariate Gaussian distribu- 
tion. What’s more, we adopt extended Kalman filter and 
smoother for state estimation so as to take into account 
all observed activities in sequence, which makes it different 
from, and potentially more effective than, HMM and RNN 
where only states at two consecutive time steps are related. 


3. OUR METHODOLOGY 
3.1 Problem Statement 


As mentioned above, our goal is to estimate the probability 
that a student stops engaging with a course in the coming 
week, given her/his learning activities up to the current time 
step. 


The temporal prediction of dropout probability requires us 
to assemble some features ' for expressing time-varying be- 
havior of students. Therefore, we extract 28 typical features 
for each week t, denoted as N dimensional vector xi,t € RY F 


1Prior to model training, these features are normalized to 
have mean 0 and variance 1, and the normalization param- 
eters (mean, standard deviation) are used for normalizing 
the testing set. 
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by considering the seven types of activity 7. The summa- 
rization of these temporal features is listed in Table 1. 


Table 1: List of features derived from each student’s 
learning activities by the week t¢ 


Features | Description 
ar The average number of activities per week by the 
week t. 
©2 The total number of activities in week t. 
@3 The average number of sessions per week by the 
week t. 3 
@A The total number of sessions in week t. 
25 The average number of active days per week by 
the week t. 4 
x6 The total number of active days in week t. 
x7 The average time consumption per week by the 
week t. 
hoe) The total time consumption in week t. 
@9 - L15 The average number of 7 different types of activ- 
ity per week by the week t. 

216 - x22 | The total number of 7 different types of activity 
in week t. 

223 — 225 | The average number of videos watched, wiki 
viewed and problem attempted per session by the 
week t respectively. 

x26 — x28 | The average number of videos watched, wiki 
viewed and problem attempted per session in 
week t respectively. 


In consequence, we obtain a sequence (Xj,1,Xi,2,---,Xi,n;) 
for each student i across n; weeks, as well as the correspond- 
ing sequence of dropout labels (yi,1, yi,2---,Yi,n;). Here ni 
represents the number of weeks during which student 7 has 
engaged with the course. Formally, for current week t, if 
there are activities associated to student i in the coming 
week, her/his dropout label in the week t is assigned y;,z = 0, 
otherwise y;,4 = 1. We can then treat the dropout predic- 
tion task as a sequential classification problem, for which the 
student’s latent states evolving over time are not observable 
directly. As illustrated in Figure 1, as the course progresses, 
given the student i’s features x;,. for the current week t, 
and his/her previous state s;4-1, we want to estimate the 
student’s current state s;,, and whether s/he will continue 
engaging with the course in the coming week y+. 


3.2 Nonlinear State Space Model (NSSM) 


Specifically, we employ a nonlinear state space model (NSSM) 
with continuous value states to summarize all the informa- 
tion about a student’s past behavior. Formally, let the vec- 
tor sit € R* (K < N) be the latent state of student i in 
the t-th week, which depends on the observed explanatory 
features x;,; and her/his previous state s;,z-1, as follows: 


Sit = Fs; 4-1 + Gxie + wit (1) 


in which the matrix F € R*** transforms the previous 
state into the current state, the matrix G € R**% trans- 
forms the observed features to reflect the current state, and 


?The seven types of activity consist of watching lecture 
videos, working on course’s problems, accessing course’s 
modules, accessing course’s wiki, posting or viewing course’s 
forum, navigating through courses, and closing course page. 
°The minimal elapsed time between two separate sessions is 
set as 60 minutes. 

“The day that has at leas one activity is treated as an active 
day. 
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course starts course ends 


previous week (t-1) | current week (t) 


bs 


Figure 1: The illustration of MOOCs dropout pre- 
diction problem and the graphical state space model. 
The dark blue signifies an observed variable and the 
light blue signifies a latent variable. 


wiz represents a diffusion variable which follows a mul- 
tivariate Gaussian with mean O and covariance Q;, (i.e., 
wie ~ N(0,Qiz)). Note that the dimension of the state 
vector K is usually smaller than the dimension of feature 
vector N. This hyperparameter K controls the complexity 
of the model, and requires manual tuning to determine its 
optimal value. 


In our work, we aim to infer the dropout probability 7: 
for student 7 in week t, which can be represented as logistic 
regression 


o(hi si,t + B: Xi) (2) 
= : (3) 


1+exp(—h?'s;4 — 87 xis) 


where h, € R**! and GB, € R**! are two vectors of coef- 
ficients for current state variable s;,z and input feature xi,+ 
respectively. In this model, the non-stationary of student 
dynamic is captured by time-evolving state variable s;,+, and 
time-varying parameters h; and (,. 


3.3. Expectation Maximization 

With the nonlinear state space model described in Eqn. 1 
and Eqn. 2, we design an Expectation-Maximization (EM) 
algorithm (see Algorithm 1) that iterates between state es- 
timation (E-step) and parameter estimation (M-step) [11]. 
The E-step makes use of extended Kalman filter and smoother 
to estimate states, and the M-step re-estimates the param- 
eters by maximizing the likelihood of all observed data, in 
which the state variables of student are replaced by their 
posteriori values from the extended Kalman smoother. 


3.3.1 Expectation Step 

In the expectation step, the expected mean of student state 
si, and its covariance P; + are obtained using the extended 
Kalman filter and smoother. Specifically, given student i’s 
entire t—1 weeks’ observation sequence DY) = = 4 (6,15 ye1); 
(Xi,2, Yi,2),-++5 , (Xi,t-1, Yi,t-1)}, the posterior mean and co- 
variance of student state s;,4-1 are supposed be represented 
by E(sit-1|Df) = s{'j} and Cov(si,x-1|Die—1) = PY} 
respectively. The predicted student state s;,4 and its covari- 
ance Pp for t = 1,2,...,n; —1,n; can then be defined 
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Algorithm 1 EM algorithm for estimating latent student 
state and model parameters. 


1: Initialize each student’s starting state s;,9 and model param- 
eters 6 = {F, G, hi, B;} 


2: repeat 

3: procedure E-step: 

4: Extended Kalman filter: For t = 1,2,...,n;—-1,j, 
correct the student state s;, and its covariance P; ; by using 
Eqn. 10 and Eqn. 11 respectively. 

5: Extended Kalman smoother: For t = nj,n; — 
1,...,2,1, smooth the predicted student state ae ) and co- 
variance Po by using Eqn. 13 and Eqn. 14 respectively. 

6: end Se ee 

1: procedure M-step: 

8: Update parameters of the model ® via equations from 
Eqn. 17 to Eqn. 20. 

9: end procedure 

10: until converged 

as: 
(t-1) _ wg(t-1) 
Sit =Fs t,t-1 + Gxi,t (4) 
t-1 t-1) nT 
Py — FPCCUF ade Qit (5) 


By following the extended Kalman filtering, the nonlinear 
function o(-) can be approximated by its Taylor series ex- 
pansion as follows: 


o(hi sie + By it) 
~ o(huss, ) +B) xi) + Ali (sie —s0 2) (6) 


Tit = 


Oo(h7 Sit + BF xit) 
OSi,t 


= 0 (nr ‘ee Z 1 GPx.) 


(1 — o(bi si + BF x:.s)) hit (7) 


A 
Ait = 


The one-step ahead prediction att ~)) for the dropout prob- 


ability is computed as: 
mi) = o(hy st.) + 6; Xi.) (8) 


For the sake of simplicity, we set the state noise covariance as 
Qi = G1, where the state noise variance q;,z is computed 
via: 

gis = max{ul? — pf”, 0} (9) 


Cs 


in which p34 = nt nel - a). After receiving a new obser- 


- 1) 


vation (xi,z,yi,z), the predicted state s; in Eqn. 4 and 


covariance Pe in Eqn. 5 will be Nee as: 
ol} = shi + Ki (vie othe? + 8Fs)) (0) 


= = (I— Ki,Aix)PS pt 1) (11) 


in which K;, is the Kalman gain tal according to [3]: 


=a 
Kin = PYVar, (AuiPOOPAT, re Qi) (12) 


(t 


It is worth noting that the predicted state s;, ) and covari- 


in Kalman filter are estimated ee on the ob- 


servation p® 


(t) 
ance pi) 


up to week t. We take advantage of extended 
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Kalman smoother to smooth the estimated states by con- 
sidering the entire sequence of the student’s observations 


pr, The smoothed states could hence be more accurate 
than the filtered ones. Specifically, the student state a 2 


and covariance Pp‘ ma) fort = ni,ni—1,...,1l are ee 


smoothed as: 


siiln = Stuct + Siena (s1) — Feit — @xiea) (13) 


PO, = PLY + die (PP -PRY) Ia 04) 
where Jiz—1 is the smoothing gain defined as: 


Sit—1 = PO) FT (Be ny (15) 


Note that the initial values a and Py for the smoother 
are the final estimates of the filter. 


3.3.2. Maximization Step 
At the maximization step, given the observed data D of N 
students, the likelihood is defined as 


N ny 
=> ovis log(o(hiis\? + BF xi2)) (16) 


i=1 t=1 


+(1—yi,2)log(1 — o (his? + BF x:,2)) 


15 Shilo al, — 
i=1 t=1 
Pena 
5 S2 do log lQi.e! 


i=l t=1 


L(D|®) 


Gx;,1)7 Q7 Tee 3") 


Fs), 


Gx;,t) 


By using the posterior hidden state variables s; (ra) from 
Kalman smoother, the optimal parameters ® = {G, 7 , hi, B,} 
can be obtained by maximizing the likelihood defined in 
Eqn. 16. We then apply the gradient based method L- 
BFGS [10] to update model parameters by using the fol- 
lowing derivation formulas respectively: 


N ny 
=> (i? Baie - 


i=l t=1 


N ny 
Ran LD (sv? - Fs, - 


i=l t=1 


Gx.,r) Qi? (7) 
Gx.,r) Qi xi,t 18) 


= =°¥* (vou = othe + 6 xi) (19) 
t 


i=1 t=1 


i See (his Ware cre xi,t)) Xi,t 20) 


i=1 t=1 


Initialization of the EM Algorithm: The initial value 
of parameters ® should be chosen with care, otherwise the 
EM algorithm may not converge. In our experiment, the 
matrix G is initially set as the transform matrix resulted 
from principle component analysis (PCA) algorithm [7], and 
the matrix F is assigned to be an identity matrix. 
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4. EXPERIMENT 


In order to evaluate the performance of our proposed model, 
we conducted an experiment on a real-life dataset. 


4.1 Dataset 

We use a data set collected from xuetangX °, one of the 
largest MOOC platforms in China. This dataset was re- 
leased for KDD CUP 2015 °. The dataset, as shown in 
Table 2, includes 79,186 students each of whom enrolled on 
at least one course among the whole set of 39 courses. Each 
enrollment is associated with a log of the student’s activi- 
ties including watching lecture videos, working on course’s 
problems, accessing course’s modules, and so on. Totally, 
there are 8,157,277 activity logs and the longest lifetime of 
enrollment is 5 weeks. 


Table 2: Statistics of xuetangX dataset for the ex- 
periment 


Item Statistical description 
# courses 39 
# students 79,186 
# enrollments 120,542 
#: activity logs 8,157,277 
# longest lifetime of enrollment 5 weeks 


i fstudents I #dropouts © dropout rate 


140000 07 


105000 


70000 


35000 


Figure 2: The number of students, number of 
dropouts, and the dropout rate in different weeks. 


As shown in Figure 2, we observe that 76,123 students 
dropped out in the first week. Another observation is that 
the longer the student has engaged with the course, the less 
likely s/he quit the course. For example, the dropout rate 
of students who have engaged with the courses for 5 weeks 
is 10.05% vs. 63.15% for 1 week. 


4.2 Evaluation Metrics 

Due to the class imbalance phenomenon, we use Area Un- 
der the Receiver Operating Characteristics Curve (AUC) 
as the evaluation metric, as it is invariant to imbalance. 
Concretely, AUC measures how likely a classifier can cor- 
rectly discriminate between positive and negative samples. 
An AUC of 1 indicates perfect discrimination whereas 0.5 
corresponds to a classifier that guesses randomly. 


°http://www.xuetangx.com 
http: //www.kddeup2015.com 
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4.3 Compared Methods 


We compared our model with related methods: 


e Logistic Regression (LG) [14]: In this method, a lo- 
gistic regression classifier is trained to make dropout 
prediction for each week. Specifically, for a student 7 in 
week t, his/her dropout probability is computed as the 
logistic function of the weighted sum of input features 


Xi,t: 
1 
P(Yi,t|Xit, Wt) = (21) 
is 1+ exp(—Yyi,4w} Xi,t) 
where wt = [wz1, wie,.-- win] is the weight vector 


to be learned. The objective function for week t is 


r - 
L(wi)= S> log(1 + exp(—yiewe Xi,t)) + F llwel? (22) 
te Nt 
where N; is the set of students who engage with the 
course in week ¢ and 1 > 0 is the regularization pa- 
rameter for w:. 


e Simultaneously Smoothed Logistic Regression (LR-SIM) [6]: 


It extends the logistic regression by smoothing the pre- 
dicted dropout probabilities across consecutive weeks. 
In this model, a regularization term is added into the 
objective function to minimize the difference of the 
predicted probabilities between two adjacent weeks, 
such as wi x; and wi 1 Xit-1.- A new feature space 
Xt is introduced, which has T x N dimensions (T is 
the total number of weeks), with the t-th component 
having N features corresponding to the features in the 
original feature space x;, for week t, and other T — 1 
components corresponding to zeroes. Then, a single 
weight vector w is introduced, which also has T x N 
dimensions corresponding to x}. The final objective 
function is defined as: 


£(w) = 32 Sos (14+ exp(—y.ew" x, .)) +22 [Iw 


iENz t=1 


T 
+ »2 >> S- lw? xi,2—Ww' x; 4-a]|? (23) 


t=2 1ENe 4-1 


where N;iz-1 is the set of students who engage with 
the course in both weeks ¢ and t — 1, and 2 > 0 is 
the regularization parameter for the difference of the 
resulted dropout probabilities between two adjacent 
weeks. 


e RNN with Long Short-Term Memory Cell (LSTM) [12]: 
It uses a recurrent neural network (RNN) model with 
long short-term memory (LSTM) architecture to train 
a sequence classifier model that produces temporal pre- 
diction. Similar to our proposed model, given the 
student’s week-by-week features and dropout labels 
{(xi,t, yit), 1 < t < ni}, the LSTM model is applied 
to estimate the student state, which can then be used 
to predict the student’s future actions. 


Note that we did not compare with Hidden Markov Model 
(HMM) based method [2] because it can be treated as a 
special case of RNN by representing student state as discrete 
variable. For all the compared models, we used the same set 
of features as input (see Table 1). 
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4.4 Results and Discussion 

The main hyperparameter to determine the NSSM model’s 
performance is the dimensionality of student state K (see 
Eqn. 1). We compared the performance of NSSM in terms of 
AUC with varying dimension of latent state kK, and observed 
that the optimal value of K in most cases is 12. Therefore, 
in our experiment, we set K as 12 to train the NSSM model. 


4.4.1 Single Course 


In this setting, we trained a separate model for each course. 
To get sufficient data for training, we only consider the pop- 
ular courses that include more than 5,000 students. After 
filtering, 6 popular courses are used in this experiment. As 
students may enroll in a course at different time steps, we 
select 70% students who enrolled in the course in early pe- 
riod as the training data, and remaining 30% students as 
the testing data. 


LR | LR-SIM | LSTM | NSSM 
Week 1 |] 0.812 0.886 0.891 | 0.900 
Week 2 |] 0.819 0.876 0.887 | 0.891 
Week 3 || 0.807 0.854 0.861 | 0.870 
Week 4 || 0.768 0.778 0.786 | 0.796 
Week 5 |] 0.673 0.679 0.689 | 0.702 


Table 3: Performance comparison of LR, LR-SIM, 
LSTM and NSSM in terms of average AUC on 6 
popular courses. 


Table 3 presents the average AUC scores across weeks by 
testing different models. The results indicate that the mod- 
els that consider dependence between consecutive weeks, 
such as LR-SIM, LSTM and NSSM, achieve higher AUC 
score than the baseline LR model without this considera- 
tion. For example, for the first week, the AUC score of 
NSSM is 0.9, which is 10.8% improvement relative to that 
of LR model. Furthermore, we can see that the methods 
that model the student’s states over time (i-e., LSTM and 
NSSM) achieve higher AUC than LR and LR-SIM in most 
cases. More notably, our proposed model NSSM performs 
consistently better than LSTM, suggesting that the student 
states estimated by NSSM is more predictive than those by 
LSTM. We can also observe that the accuracy during early 
weeks is higher than that of later weeks by most of mod- 
els. This implies that the dropout prediction task may be- 
come harder with increasing lifetime of engagement, as there 
might be various hidden reasons that cause a student to quit 
the course. 


4.4.2. Across Courses 

In this setting, we are interested in evaluating whether the 
proposed model trained on some courses can serve other 
courses as well, for which we randomly select 70% courses for 
training and remaining 30% for testing. In this experiment, 
we use all of the student data from the training courses to 
train the model. 


Table 4 shows the performance comparison. Same conclu- 
sions can be made as in the previous Section 4.4.1. Specif- 
ically, from this table, we can observe that our proposed 
model NSSM still outperforms the other models (e.g., LR, 
LR-SIM and LSTM) across different weeks. For example, 
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LR | LR-SIM | LSTM | NSSM 
Week 1 |] 0.835 0.933 0.936 | 0.936 
Week 2 |] 0.911 0.915 0.915 | 0.919 
Week 3 || 0.868 0.872 0.867 | 0.871 
Week 4 |] 0.782 0.784 0.785 | 0.789 
Week 5 |] 0.655 0.662 0.673 | 0.686 


Table 4: Performance comparison of LR, LR-SIM, 
LSTM and NSSM in terms of AUC on new courses 
across weeks. 


for the first week, the AUC score of NSSM is 0.686, which 
is 12% improvement relative to that of LR model. Further- 
more, we can see that the improvement from NSSM with re- 
gard to LSTM is slight, and the relative improvement during 
later weeks is larger than that of early weeks (e.g., +5.1% 
during week 4 vs +4.4% during week 2). This observation 
implies that the NSSM has the potential to make better 
dropout predictions for students who have longer lifetime 
of engagement than LSTM. In addition, as these results are 
predictions made for students from new courses, we can con- 
clude that our proposed model is capable of making better 
dropout prediction in new courses, in comparison with other 
models. 


5. CONCLUSIONS AND FUTURE WORK 


In this paper, we have focused on identifying at-risk stu- 
dents in online courses by making dropout prediction. We 
particularly take advantage of nonlinear state space model 
(NSSM) because it can discover a student’s latent state to 
characterize the student’s intention to perform certain ac- 
tivities. We conducted experiment on a real-world dataset, 
which demonstrates that our proposed model achieves higher 
prediction accuracy than related methods. We also showed 
that the NSSM model trained on data from some courses 
can make dropout prediction for students in new courses. 


However, because the extended Kalman filter and smoother 
we used in this paper may not be an optimal parameter es- 
timator, the difference between NSSM and LSTM is slight. 
Therefore, in the future, we will exploit other advanced al- 
gorithms (e.g., Unscented Kalman filter) to estimate the pa- 
rameters in our nonlinear state space model. For the second 
future direction, as the experiment presented in this paper 
is limited to xuetangX dataset, we plan to evaluate our pro- 
posed model on datasets collected from other MOOC plat- 
forms, such as Edx and Coursera. 
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